[trainer, worker] feat: more flexible and easy-to-use reward model #3679

yyDing1 · 2025-10-04T17:40:41Z

The current reward model implementation faces the following challenges:

Model Support: It is primarily designed for discriminative models and lacks robust support for generative reward models.
Complexity: It relies on heavy-weight backends like FSDP or Megatron, which are often unnecessary for typical reward model inference tasks.
Flexibility: The batch-level synchronization mechanism hinders the implementation of more flexible, sample-level reward functions for the developers.

What this PR does

To address these issues, this PR introduces a more flexible and easy-to-use reward model design. Specifically, it implements two main classes: RewardModelManager and RewardManagerWorker, with some runnable scripts in recipe/fapo.

RewardModelManager first launches multiple reward servers and then adopts a router-based approach to manage these servers (using SGLang Router), distributing requests to reward servers.
RewardManagerWorker retrieves the remote actor handle, providing users with greater flexibility in designing custom reward functions. For example, users can easily implement a customized reward function like the following:

async def compute_score(
    data_source: str,
    solution_str: str,
    ground_truth: str,
    extra_info: dict,
    reward_router_address: str,
    reward_model_tokenizer: PreTrainedTokenizer,
):
    # Compute rule-based reward score
    rule_based_score = ...

    # Compute GRM reward score
    grm_prompts = ...
    grm_prompt_ids = ...
    # Users can directly call the reward model
    grm_outputs = post(f"{http://{reward_router_address}/generate}", ...)  # post request to reward router
    ...

    # Final reward score
    final_score = ...

    return final_score

This implementation provides a reward_model interface in the compute_score method, maximizing flexibility and convenience for algorithmic design.

Note that this is an asynchronous function, so efficiency is not a concern—each sample is processed asynchronously.

Integration with AgentLoop

This PR introduces asynchronous reward computation for individual samples (async def run_single(self, data: DataProto) -> dict) and leverages an event loop to handle reward computation in parallel, significantly improving processing efficiency.

Moreover, this implementation can be integrated with agentloop for improved efficiency (has been implemented):

In this mode, the reward model operates independently from the rollout process (standalone mode), enabling a natural async data flow where each sample undergoes reward rollout immediately after actor rollout.

With this implementation, code redundancy is reduced in the existing reward model while maximizing flexibility for user-customized reward functions.

Runable Scripts

A runnable example is provided in recipe/fapo/. The newly introduced parameters for this implementation are placed in fapo/config and will be integrated into the main codebase upon completion of the refactoring.

wuxibin89 · 2025-10-17T02:10:01Z

Does sglang wake_up/sleep works in colocated mode? I observed that sglang seems resume with random weights in colocated mode.
https://github.com/volcengine/verl/blob/main/verl/workers/rollout/sglang_rollout/async_sglang_server.py#L183-L188

yyDing1 · 2025-10-17T07:11:56Z

Does sglang wake_up/sleep works in colocated mode? I observed that sglang seems resume with random weights in colocated mode. https://github.com/volcengine/verl/blob/main/verl/workers/rollout/sglang_rollout/async_sglang_server.py#L183-L188

Yes, I checked the related issues and found that the same phenomenon was mentioned in sgl-project/sglang#6367 (comment). In short, since RL normally uploads a new set of parameters, sglang simply discards the old ones to speed up. I also looked into the recent PR sgl-project/sglang#10873, which seems to add support for reusing the original weights by keeping a stored copy.

yyDing1 · 2025-10-17T07:16:48Z

The PR seems to be included in sglang 0.5.3.
So we may pass enable_weights_cpu_backup = True when launching the sglang servers (for reward models) to resolve this issue.

.github/workflows/reward_model.yml

docs/advance/reward_loop.rst

verl/experimental/agent_loop/agent_loop.py

verl/experimental/reward/reward_model.py

…olcengine#3679) The current reward model implementation faces the following challenges: 1. Model Support: It is primarily designed for discriminative models and lacks robust support for generative reward models. 2. Complexity: It relies on heavy-weight backends like FSDP or Megatron, which are often unnecessary for typical reward model inference tasks. 3. Flexibility: The batch-level synchronization mechanism hinders the implementation of more flexible, sample-level reward functions for the developers. ### What this PR does To address these issues, this PR introduces a more flexible and easy-to-use reward model design. Specifically, it implements two main classes: `RewardModelManager` and `RewardManagerWorker`, with some runnable scripts in `recipe/fapo`. <img width="1732" height="1188" alt="image" src="https://github.com/user-attachments/assets/50fa8358-483c-44af-ba7b-3b696306c3db" /> - `RewardModelManager` first launches multiple reward servers and then adopts a router-based approach to manage these servers (using [SGLang Router](https://docs.sglang.ai/advanced_features/router.html)), distributing requests to reward servers. - `RewardManagerWorker` retrieves the remote actor handle, providing users with greater flexibility in designing custom reward functions. For example, users can easily implement a customized reward function like the following: ```python async def compute_score( data_source: str, solution_str: str, ground_truth: str, extra_info: dict, reward_router_address: str, reward_model_tokenizer: PreTrainedTokenizer, ): # Compute rule-based reward score rule_based_score = ... # Compute GRM reward score grm_prompts = ... grm_prompt_ids = ... # Users can directly call the reward model grm_outputs = post(f"{http://{reward_router_address}/generate}", ...) # post request to reward router ... # Final reward score final_score = ... return final_score ``` This implementation provides a `reward_model` interface in the `compute_score` method, maximizing flexibility and convenience for algorithmic design. Note that this is an asynchronous function, so efficiency is not a concern—each sample is processed asynchronously. ### Integration with AgentLoop This PR introduces asynchronous reward computation for individual samples (`async def run_single(self, data: DataProto) -> dict`) and leverages an event loop to handle reward computation in parallel, significantly improving processing efficiency. Moreover, this implementation can be integrated with `agentloop` for improved efficiency (has been implemented): <img width="2362" height="1280" alt="image" src="https://github.com/user-attachments/assets/4297428d-194b-4c6f-aff1-69daf02ca743" /> In this mode, the reward model operates independently from the rollout process (standalone mode), enabling a natural async data flow where each sample undergoes reward rollout immediately after actor rollout. With this implementation, code redundancy is reduced in the existing reward model while maximizing flexibility for user-customized reward functions. ### Runable Scripts A runnable example is provided in `recipe/fapo/`. The newly introduced parameters for this implementation are placed in `fapo/config` and will be integrated into the main codebase upon completion of the refactoring.

yyDing1 and others added 30 commits September 23, 2025 13:00

update

75183ed

Merge branch 'volcengine:main' into rm

ffb4ba3

update reward config

99adc05

change replica

38af059

restore test file

6d664a5

Merge branch 'volcengine:main' into rm

f32cf86

update

8965917

update

3e8224e

fix ci

86d739d

update

ade706e

update

ba613d9

Merge branch 'volcengine:main' into fapo

69d1f5f

update

2b7222f

update

cb58490

update

abb0b64

Merge branch 'volcengine:main' into fapo

8910f1c

update

77a4368

update

5833c54

update

4ed7416

update

8827de3

update

984658c

update

1296c50

update

b9c7d3c

update

c3f6013

update

ed2d771

Merge branch 'volcengine:main' into fapo

fbd3ce2

update

29e23fd

update

e7b41f2

update

c45a916

fix

c41bfd6

yyDing1 added 4 commits October 16, 2025 20:45

add naive router

49068d6

fix

e372031

update

89bc00e

update

68bef06

yyDing1 added 2 commits October 17, 2025 14:39

update

2ef04a7

fix ci

0b76e6f

yyDing1 and others added 7 commits October 19, 2025 12:44

Merge branch 'main' into fapo

bdd01e1

update

1212938

update

b088993

fix ci

4d94256

fix

b878fd0

fix

5b99c44

Merge branch 'volcengine:main' into fapo

303805a

wuxibin89 reviewed Oct 20, 2025

View reviewed changes

.github/workflows/reward_model.yml Outdated Show resolved Hide resolved

wuxibin89 reviewed Oct 20, 2025

View reviewed changes

docs/advance/reward_loop.rst Show resolved Hide resolved

verl/experimental/agent_loop/agent_loop.py Outdated Show resolved Hide resolved

verl/experimental/reward/reward_model.py Show resolved Hide resolved

yyDing1 added 5 commits October 21, 2025 02:45

add vllm support and ci test for agentloop with reward manager

dcdb883

pre-commit reformat

e09c077

fix ci

cbe12c5

fix ci

49a0d7e

update

b7f4ad7

wuxibin89 approved these changes Oct 21, 2025

View reviewed changes

wuxibin89 merged commit d55929f into volcengine:main Oct 21, 2025
82 of 85 checks passed

vermouth1992 mentioned this pull request Oct 21, 2025

[misc] feat: add megatron script for open math reasoning #3844

Merged

7 tasks

yyDing1 deleted the fapo branch October 23, 2025 09:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[trainer, worker] feat: more flexible and easy-to-use reward model #3679

[trainer, worker] feat: more flexible and easy-to-use reward model #3679

Uh oh!

yyDing1 commented Oct 4, 2025 •

edited

Loading

Uh oh!

wuxibin89 commented Oct 17, 2025 •

edited

Loading

Uh oh!

yyDing1 commented Oct 17, 2025 •

edited

Loading

Uh oh!

yyDing1 commented Oct 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[trainer, worker] feat: more flexible and easy-to-use reward model #3679

[trainer, worker] feat: more flexible and easy-to-use reward model #3679

Uh oh!

Conversation

yyDing1 commented Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

Integration with AgentLoop

Runable Scripts

Uh oh!

wuxibin89 commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yyDing1 commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yyDing1 commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yyDing1 commented Oct 4, 2025 •

edited

Loading

wuxibin89 commented Oct 17, 2025 •

edited

Loading

yyDing1 commented Oct 17, 2025 •

edited

Loading

yyDing1 commented Oct 17, 2025 •

edited

Loading